An Algorithm for the Identification of Undiagnosed COPD Cases Using Administrative Claims Data

BACKGROUND: Chronic obstructive pulmonary disease (COPD) is a major cause of death in the United States, but most persons who have airflow obstruction have never been diagnosed with lung disease. This undiagnosed COPD negatively affects health status, and COPD patients may have increased health care utilization several years before the initial diagnosis of COPD is made. OBJECTIVES: To investigate whether utilization patterns derived from analysis of administrative claims data using a discriminant function algorithm could be used to identify undiagnosed COPD patients. METHODS: Each patient who had a new diagnosis of COPD during the study period (N=2,129) was matched to as many as 3 control subjects by age and gender. Controls were assigned an index date that was identical to that of the corresponding case, and then all health care utilization for cases and controls for the 24 months prior to the initial COPD diagnosis was compared using logistic regression models. Factors that were significantly associated with COPD were then entered into a discriminant function algorithm. This algorithm was then validated using a separate patient population. RESULTS: In the main model, 19 utilization characteristics were significantly associated with preclinical COPD, although most of the power of the discriminant function algorithm was concentrated in a few of these factors. The main model was able to identify COPD patients in the validation population of adult subjects aged 40 years and older (N=41,428), with a sensitivity of 60.5% and specificity of 82.1%, even without having information on the history of tobacco use for the majority of the group. Models developed and tested on only 12 months of utilization data performed similarly. CONCLUSIONS: Discriminant function algorithms based on health care utilization data can be developed that have sufficient positive predictive value to be used as screening tools to identify individuals at risk for having undiagnosed COPD.

OBJECTIVE: To investigate whether utilization patterns derived from analysis of administrative claims data using a discriminant function algorithm could be used to identify undiagnosed COPD patients.
METHODS: Each patient who had a new diagnosis of COPD during the study period (N = 2,129) was matched to as many as 3 control subjects by age and gender. Controls were assigned an index date that was identical to that of the corresponding case, and then all health care utilization for cases and controls for the 24 months prior to the initial COPD diagnosis was compared using logistic regression models. Factors that were significantly associated with COPD were then entered into a discriminant function algorithm. This algorithm was then validated using a separate patient population. RESULTS: In the main model, 19 utilization characteristics were significantly associated with preclinical COPD, although most of the power of the discriminant function algorithm was concentrated in a few of these factors. The main model was able to identify COPD patients in the validation population of adult subjects aged 40 years and older (N = 41,428), with a sensitivity of 60.5% and specificity of 82.1%, even without having information on the history of tobacco use for the majority of the group. Models developed and tested on only 12 months of utilization data performed similarly.
could identify persons who are later diagnosed with COPD with statistically significant and clinically useful sensitivity and positive predictive value. To examine this hypothesis, we first captured the patterns of health care utilization found among COPD patients in the 2 years prior to their first COPD diagnosis and then compared these with age-and gender-matched non-COPD patients. Factors associated with preclinical COPD were then entered into a discriminate function algorithm program to determine which factors could be used to discern whether or not a given individual was likely to have COPD. The algorithm was then applied to a validation population to see if it performed with adequate sensitivity and specificity to make it a practical tool for screening purposes.

■■ Methods Study Site
This study was conducted among members of the Lovelace Health Plan (LHP), a staff-and network-model health maintenance organization (HMO) serving New Mexico. Lovelace Health Plan is the insurance component of LovelaceSandia Health Systems (LSHS), which also operates a network of primary care clinics, specialty centers, and hospitals. LHP served approximately 240,000 health plan members in 2001, including members of the commercial plan (approximately 700 employer groups) and managed Medicare and Medicaid plans. LSHS also serves 70,000 to 80,000 fee-for-service clients each year. As ascertained by self-report on membership surveys, LHP health plan members are 38.7% Hispanic, 55.8% non-Hispanic white, 2.1% Native American, and 3.4% other racial designations.

Algorithm Development and Validation Cohorts
All Lovelace Health Plan members who were aged 40 years or older during calendar year 2001 were randomly assigned to either the algorithm development group or the algorithm validation group. The algorithm development group was used to identify utilization characteristics associated with COPD and to develop the discriminant function algorithm. The algorithm validation group was used to test the operational characteristics of the algorithm and to prove its practical usefulness.
COPD patients in both the algorithm development and validation groups were identified using medical and pharmacy claims records. Any patient with one or more claims records associated with an International Classification of Diseases, Ninth Revision (ICD-9) code of 491.x (chronic bronchitis), 492.x (emphysema), or 496 (COPD) were designated as a COPD case. We have examined the validity of these codes and this system for identifying COPD by medical record abstraction in previous projects and found them to be accurate, with more than 95% of these patients having documented clinical evidence to support the diagnosis. 18 For inclusion in the algorithm development group, we also required that COPD patients had received their first COPD diagnosis (index diagnosis) between 1997 and 2001. To be considered a new diagnosis, there could be no diagnosis codes for COPD appearing prior to the index diagnosis dating back to the patient' s initial membership in the health plan, or as of  January 1, 1990. We excluded any patient who also had a diagnosis of cancer (ICD-9 140-208; n = 260) or who had another chronic lung disease not associated with COPD (494.x, 405.x, and 500-519.x). Exceptions were skin cancers, excluding melanoma (173.x); breast cancer (174.x); prostate cancer (185.x); and benign neoplasms (210-239). These were permitted since they tend to be indolent tumors with little effect on lung function. Asthma (493.x) was also permitted because of the common overlap between COPD and asthma. A total of 2,129 COPD cases meeting all inclusion and exclusion criteria were identified for the algorithm development.
For each COPD case, we attempted to find 3 age-matched (matched to within 5 years, older or younger) and sex-matched controls who did not have a diagnosis of COPD in their claims records. Because of the advanced age of some COPD patients, it was not possible to obtain a 3:1 match for each case. Nevertheless, we were able to match 3:1 for 75% of the cohort (n = 1,602), 2:1 for 22% of the cohort (n = 461), 1:1 for 3% of the cohort (n=62), and only 4 patients could not be matched at all. The final total for the control cohort was 5,790 patients.

Algorithm Development
To identify factors associated with preclinical COPD, we captured all hospitalizations, outpatient encounters, and outpatient pharmacy prescription fills for 2 years prior to each patient' s COPD diagnosis and during the same time period for their matched controls. The 60 days prior to the date of initial COPD diagnosis were excluded because utilization during this time period is likely to be biased toward events that led to making the diagnosis. Because of the vast number of ICD-9, current procedural terminology (CPT), and American Hospital Formulary Service (AHFS) codes used in this database, it was necessary to condense some of the codes down to one descriptive term that could then be entered into the logistic regression model. Tables 1 and 2 list the diagnostic categories, related SAS variables, and the associated ranges for the ICD-9, CPT, or AHFS codes that were used in our models. We then identified the utilization characteristics that were most strongly associated with COPD using forward step-wise conditional logistic regression equations (SAS for Windows 8.2; Cary, NC). Factors that made a significant contribution to the logistic regression models were then put into the discriminant function algorithm6 (STEPDISC procedure in SAS) and run in the administrative claims for the algorithm development population. Ultimately, only those factors that contributed an R2 value of 0.0015 or greater to the final algorithm were kept. An additional algorithm was also developed that was based on 1 year of utilization data.

Identification and Characterization of Exacerbations
Due to the profound impact that exacerbations have on overall utilization in COPD and the possibility that repeated exacerbations may lead to the diagnosis, we examined whether identifying exacerbation events could improve the overall performance of the algorithm. Exacerbations were defined as any inpatient or outpatient encounter with a primary diagnosis of a respiratory system disease (ICD-9 codes 462.x-519.x) that was also associated with a prescription fill for an antibiotic or respiratory medication.

Analyses: Algorithm Validation Phase
The algorithm was then applied to the validation group' s 1998 and 1999 claims records to test its sensitivity, specificity, and positive predictive value as compared with the clinical diagnosis. Two-by-two tables were created, with the claims diagnosis considered to be "gold standard" for comparison with the algorithms selection results. Sensitivities, specificities, and positive predictive values were calculated using the STEPDISC program.

Medical Record Review
As an additional validation measure and to help estimate the practical usefulness of the algorithm, we abstracted the medical records of 200 patients from the validation group that the algorithm had identified as likely to have COPD but who had never had the clinical diagnosis. Also, to help understand why the algorithm failed to identify some COPD patients, we abstracted the records of another 200 patients from the validation group who had a clinical diagnosis of COPD but who the algorithm did not identify as at-risk patients. All records were selected at random and abstracted by an experienced abstractor using a standardized instrument. Specific clinical information that would support the diagnosis of COPD included documentation of chronic respiratory complaints (e.g., dyspnea, cough, wheezing, or more than 2 bouts of bronchitis within 12 months), spirometry showing airflow obstruction, chest radiographs with changes consistent with COPD, or a history of cigarette smoking. Documentation of 2 of these findings was considered to be a likely COPD case.

■■ Results
In a stratified comparison of utilization, prescription, and exacerbation factors between the COPD cases and controls, several areas of increased utilization can easily be identified ( Table 3). As expected, tobacco use is a significant factor, but, unfortunately, the "V" codes identifying tobacco use are not used routinely in this health system. The relatively low use of respiratory medications (38.1%) among the COPD cases was not unexpected since this is a time period prior to the first diagnosis of COPD. Respiratory symptoms, episodes of bronchitis, and use of chest radiographs were common among the COPD patients, as were other diseases associated with smoking, such as cardiovascular disease. Results from the main logistic regression model are depicted in Figure 1. Note that a history of tobacco use makes a large contribution to the model, even though only a small proportion of patients in the COPD group (13.4%) were ever given this code. Inclusion of exacerbations did not have a significant association with COPD after inclusion of the other clinical factors in the model, so our system for identifying exacerbations was not entered in the final algorithm. Having multiple visits or pharmacy fills was no more predictive of having COPD than just having 1 pharmacy fill, so we did not use indicators of high utilization in any specific area as separate predictors in the model. The predictive ability of the logistic regression model was relatively good (percent concordance: 71.1) as were the model fit statistics (Wald chi-square 968, 30 degrees of freedom; P <0.001).  Only the 19 factors that were significantly associated with COPD in the logistic regression model were entered into the main discriminant function algorithm (Table 4). When applied to the algorithm development population, the sensitivity of the model was 44.7% and specificity 85.8%. Exclusion of factors that contributed only 10 or less to the F value (i.e., the last 6 factors) had very little effect on the model' s sensitivity and specificity (44.5% and 85.5%, respectively). In a second model developed using only data from the 12 months prior to the diagnosis, nine of the 10 leading factors included in the model were also among the 10 leading factors in the original model (Table 5). When applied to the algorithm development population, this algorithm' s sensitivity was 42.8% and sensitivity 84.6%. When this model was retested after excluding the variable for tobacco use, the sensitivity fell slightly to 41.0% and specificity to 83.9%.

Algorithm Validation
The algorithm was then applied to the validation population with 2 years of utilization data (Table 6a). Of a total population of 41,428 adults aged 40 years and older, 2,240 out of 3,704 COPD patients (60.5%) were correctly identified, with a positive predictive value of 25%. When the algorithm was applied only to persons aged 65 years and older, the sensitivity improved to 64% and positive predictive value to 38% (Table  6b). When the model was applied to the validation population with only 12 months of cumulative utilization data, its performance was only slightly less than that seen in the 2-year population (Tables 7a and 7b).
When we excluded tobacco use from the algorithm and reapplied it to the validation population, the effects were very minor. When applied to the validation population with only 12 months of utilization data and restricted to persons aged 65 years and older, the sensitivity fell only from 60.5% to 59.9%, and the positive predictive value declined from 38.6% to 37.6%. Although it would most likely be advantageous to have tobacco use history on all patients, the algorithm is able to identify more than half of the COPD patients in this group even without any tobacco history information.

Results From the Medical Record Review
Of 200 patients who were identified by the algorithm as likely to have COPD but who did not have a clinical diagnosis, 55 (27.5%) had at least 2 types of evidence supporting a diagnosis of COPD in their medical records. These tended to be persons who were smokers and who were treated one or more times for respiratory infections. Conversely, of 200 COPD patients who the algorithm said did not have COPD, 69 (34.5%) did not have at least 2 types of evidence supporting the diagnosis. These tended to be persons who did not appear to be very compliant with medication regimens or follow-up to treatment.

■■ Discussion
Our study shows that it is possible to create a predictive algorithm that uses routinely collected administrative data to identify persons who may have preclinical or undiagnosed COPD. The algorithm works even with very little or no information on tobacco use in the database. We believe that this algorithm could be used as part of an efficient and effective secondary health intervention system to identify persons with possible undiagnosed COPD and refer them for appropriate work-up and treatment.
We examined a variety of factors that affected the performance of the algorithm. The biggest limitation of this system is that the sensitivity depends mostly on the patient having health care utilization for COPD or other tobacco-related conditions. A substantial proportion of the COPD population in the present study did not have such utilization. For example, antibiotic use, the most common factor among the COPD patients, was found in only 41.2% of the COPD patients in the 2 years prior to diagnosis. Improved sensitivity allows one to identify more patients who have COPD or who are at risk for the disease. As patient age increases and overall health care utilization also increases, the sensitivity of the algorithm improves. Nevertheless, the algorithm' s sensitivity among persons aged 40 to 49 years in the validation group was 49.9%. Thus, this algorithm can efficiently identify many persons with relatively early disease and help get their COPD diagnosed and treated before severe lung disease and permanent disability have set in.
One argument against early case finding in COPD is the lack of interventions proven to change the course of the disease. The Lung Health Study has shown that COPD patients who are provided smoking cessation counseling and who manage to abstain from cigarettes for at least 5 years have significantly improved probability of survival. 19 To date, no pharmacologic intervention has been proven to improve survival or slow the accelerated airflow obstruction that is characteristic of COPD.

Limitations
There are limitations to this study that should be noted before application to other cohorts. The clinical characteristics of the Lovelace Health Plan COPD population are likely to be at least slightly different from those found elsewhere, and the practice habits of Lovelace Health System physicians are also likely to be different. Because the algorithm is based on specific utilization characteristics, this algorithm is likely to perform somewhat differently in other health care systems. The positive predictive value of any test depends on the prevalence of the target disease in the study population, so cohorts that have a lower prevalence of COPD than ours can be expected to have a lower positive predictive value. The algorithm had a specificity in the range of 57% to 64% which indicates that the majority of patients without COPD were appropriately classified, yet a substantial proportion of patients without disease would be classified as being at risk. Our tests of the various factors affecting the algorithm' s performance suggest that it is sufficiently robust for application in other managed care systems; however, further validation of this algorithm in other systems is warranted. A sensitivity of 40% to 64% and specificity of 71% to 87% would generally not be considered adequate to support the routine use of the algorithm as a screening test. However, we do not suggest that this algorithm be applied in the same way that most clinical screening tests are applied. This algorithm should be viewed simply as a tool that can efficiently identify a large number of persons who have an increased risk of having a debilitative and progressive respiratory condition. Tests with positive predictive values in the 20% to 50% range can be very effectively applied in early screening programs, but whether or not this level of efficiency is adequate depends on judgments about the impact of the disease, the usefulness of early intervention, and the costs of screening. It is likely that addition of a second screening test to this algorithm, particularly     Algorithm Result information about the history of tobacco use, could significantly improve the overall positive predictive value. Application of this algorithm will certainly not replace current recommendations that all adults who have smoked 1 pack of cigarettes per day for 10 years or who have other risk factors for COPD have spirometry performed to document evidence of airflow obstruction. 8,9 Application of the algorithm can, however, help identify those persons who are suffering the effects of COPD earlier and direct them toward appropriate therapy.

■■ Conclusion
We believe that this algorithm works well enough for practical application but, ultimately, the success of the algorithm will depend on how well it works when applied as part of a program for identifying and managing persons at risk for undiagnosed COPD. Application of the algorithm to a database is relatively simple and only 12 to 24 months of continuous data are needed. Of the 9,010 persons in the validation group that the algorithm identified as being likely to have COPD, 2,240 were confirmed cases (PPV = 25%). Hence, it may be useful to further screen persons through the use of respiratory-symptom questionnaires and history of tobacco use before referring them for spirometry. We also note that we did not match on race because race and ethnicity are not variables in the database that was used in the present study.
There are many practical issues that still must be considered, such as how to appropriately approach patients who have been identified as being at risk, how to combine the results of the algorithm with information about tobacco history, and how best to communicate this information to the patient' s primary care provider. COPD is a growing problem, especially among women, and to reverse this trend, we must find innovative ways to identify patients at risk for the disease and at earlier stages. Most managed care systems routinely collect the data on which this algorithm is based, so we strongly recommend that they consider using this approach as part of a program to improve COPD care for their patients.